fix: Add schema validation for native_datafusion Parquet scan by vaibhawvipul · Pull Request #3759 · apache/datafusion-comet

vaibhawvipul · 2026-03-22T08:28:44Z

When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.

Which issue does this PR close?

Closes #3720 .

Rationale for this change

DataFusion is more permissive than Spark when reading Parquet files with mismatched schemas. For example, reading an INT32 column as bigint, or TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks correctness guarantees that Spark users rely on.

What changes are included in this PR?

Adds schema compatibility validation in schema_adapter.rs :

validate_spark_schema_compatibility() checks each logical field against its physical counterpart when a file is opened
is_spark_compatible_read() defines the allowlist of valid Parquet-to-Spark type conversions (matching TypeUtil's logic)
Incompatible reads now produce errors in "Column: [name], Expected: <type>, Found: <type>" format
Correctly allows INT96→LTZ (DataFusion coerces INT96 to NTZ) and Timestamp→Int64 (nanosAsLong)

How are these changes tested?

parquet_int_as_long_should_fail - SPARK-35640: INT32 read as bigint is rejected
parquet_timestamp_ltz_as_ntz_should_fail - SPARK-36182: TimestampLTZ read as TimestampNTZ is rejected
parquet_roundtrip_unsigned_int - UInt32→Int32 (existing test, still passes)
test_is_spark_compatible_read - unit test covering compatible cases (Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases (Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal precision/scale mismatches)

When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.

andygrove · 2026-03-22T15:47:40Z

Thanks for working on this @vaibhawvipul. This looks like a good start. Note that the behavior does vary between Spark versions. Spark 4 is much more permissive, for example.

Could you add end-to-end integration tests, ideally using the new SQL file based testing approach or with Scala tests that compare Comet and Spark behavior.

andygrove · 2026-03-22T16:57:22Z

@vaibhawvipul you need to run "make format" to fix lint issues

vaibhawvipul · 2026-03-22T17:05:37Z

@vaibhawvipul you need to run "make format" to fix lint issues

Thank you. Fixed.

…validation

comphead · 2026-03-22T20:16:01Z

I'm tentative how we should proceed considering widening data types coerce support in Spark 4.0. Would it be better just to document that Comet in such cases allows coercion in Spark 3.x? 🤔

vaibhawvipul · 2026-03-23T04:14:40Z

I'm tentative how we should proceed considering widening data types coerce support in Spark 4.0. Would it be better just to document that Comet in such cases allows coercion in Spark 3.x? 🤔

This is simpler for sure. We can document that Comet is more permissive than Spark 3.x . However, this PR keeps validation for clarly invalid cases regardless of the spark version.

The validation isn't trying to be stricter than Spark 3.x - it's preventing DataFusion from silently producing wrong results for genuinely incompatible types.

vaibhawvipul added 2 commits March 22, 2026 13:51

Add schema validation for native_datafusion Parquet scan

c94efed

When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.

Merge branch 'main' into issue-3720

911a358

Add integration test for schema mismatch validation

9dff312

vaibhawvipul added 2 commits March 22, 2026 22:31

Fix Spotless formatting violation in ParquetReadSuite

3aac6f3

Handle RunEndEncoded Arrow types in schema validation

92f2454

Allow valid type widenings and Int64-Timestamp conversions in schema …

848ffa3

…validation

Merge branch 'main' into issue-3720

d8ab81a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add schema validation for native_datafusion Parquet scan#3759

fix: Add schema validation for native_datafusion Parquet scan#3759
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3720

vaibhawvipul commented Mar 22, 2026 •

edited

Loading

Uh oh!

andygrove commented Mar 22, 2026 •

edited

Loading

Uh oh!

andygrove commented Mar 22, 2026

Uh oh!

vaibhawvipul commented Mar 22, 2026

Uh oh!

comphead commented Mar 22, 2026

Uh oh!

vaibhawvipul commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vaibhawvipul commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Mar 22, 2026

Uh oh!

vaibhawvipul commented Mar 22, 2026

Uh oh!

comphead commented Mar 22, 2026

Uh oh!

vaibhawvipul commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaibhawvipul commented Mar 22, 2026 •

edited

Loading

andygrove commented Mar 22, 2026 •

edited

Loading